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METHOD FOR SEQUENCING POLYNUCLEOTIDES 



FIELD OF THE INVENTION 

This invention relates to computational methods in molecular biology, and 
more specifically to methods for determining the sequence of a polynucleotide. 
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BACKGROUND OF THE INVENTION 

Sequencing by hybridization (SBH) is a method for sequencing a 
polynucleotide such as a DNA molecule (Bains & Smith 1988, Lysov et al. 1988, 
Southern 1988, Drmanac and Crkvenjakov 1987, Macevics 1989). In this 
method, a chip, or microarray is used consisting of a surface upon which all 
possible oligonucleotide probes of a particular length k (referred to herein as 
"k-mers") are immobilized (Southern 1996). The DNA molecule whose sequence 
is to be determined, referred to as the "target molecule", is allowed to hybridize 
to the k-mers on the chip. The target molecule and the k-mers on the chip may all 
be single stranded molecules. Alternatively, a double stranded target may first be 
cut into fragments having single stranded "sticky ends", and the k-mers on the 
chip may be the sticky ends of double stranded molecules. Ideally, a single 
stranded target or the sticky end of a double stranded target hybridizes to a k-mer 
on the chip if and only if the sequence complementary to the k-mer occurs 
somewhere in the target sequence or the sticky end. Thus, in principle, it is 
possible to experimentally determine the "k-spectrum" of the target (the set of all 
k-long substrings present in the target). In practice, however, the data are 
ambiguous due to the ability of the target to bind to k-mers that are only partially 
complementary to one of its substrings. Thus, any binarization of the 
hybridization signal will contain errors. 

The goal of SBH is to determine the target sequence from the target 
spectrum. However, even if the target spectrum were error free, the target 
sequence is not uniquely determined by the spectrum. If the number of sequences 
consistent with the spectrum is large, there is no satisfactory method to select the 
true sequence. Theoretical analysis and simulations (Southern et al, 1992, 
Pevzner and Lipshutz 1994) have shown that even when the spectrum is errorless 
and the correct multiplicity of each k-mer in the target sequence is known, the 
average length of a uniquely reconstructible target sequence using a chip of 
8-mers is only about two hundred nucleotides, far below the length of a DNA 
molecule that may be sequenced by electrophoresis. 
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Let E = (A,C,G,T) designate the set of nucleotides composing a DNA 
molecule. M = 4 is the "alphabet size". A DNA sequence is a string over S which 
is denoted herein between braces (o). The k- spectrum of a target sequence T of 
length L, T= <ti 5 t 2 ,...t L >, is the set of all k-long substrings (k-mers) of T. For 
5 each k-mer x = <xj, x 2 ,...x k > in e Z k , we define T(x)to be 1 if x is a substring 
of T, and 0 otherwise. We denote K = M k , the number of k-mers. A hybridization 
experiment measures, for each k-mer x in e£ k , an intensity of its hybridization 
with the target. 

The result of an SBH experiment may be described by a graph in which 
10 each candidate target sequence is represented as a path in a graph (Pevzner et al, 
1989). The graph is a directed de-Bruijn graph G(V,E) whose vertices are labeled 
by all the (k-l)-mers (the set of vertices V = E kH ) 9 and its edges are labeled by 
k-mers, (the set of edges E = Z k ). The edge labeled <xi, X2...Xk> connects the 
vertex <xi, x 2 ... Xk-i> to the vertex <x 2 ...Xk>. There is a 1-1 correspondence 
15 between L-long candidate target sequences and (L - k + 1)- long paths in G, 
whose edge labels comprise the target spectrum. Hereafter, we interchangeably 
refer to edges and their labels, and also to sequences and their corresponding 
paths. 

Since k-mers may reoccur in the target sequence, the paths do not have to 
20 be simple. When the spectrum is perfect and the multiplicities of the k-mers in 
the spectrum are known, every solution is an Eulerian path (Pevzner et al 1989). 
In practice, however, the spectrum is not perfect and the multplicities are not 
known. 

Alternative chip designs (Bains and Smith 1988, Khrapko et al. 1989, 
25 Pevzner et al 1991, Preparata et al 1999, Ben-Dor et al. 1999), as well as 
interactive protocols (Skiena and Sundaram 1995) have been suggested, often 
assuming additional information, in order to reduce the ambiguity of the 
hybridization-based reconstruction. 

Nucleotide sequences from different sources may resemble each other, due 
30 to a common ancestral gene. This phenomenon is encountered within a species, 
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between duplicated regions within a genome, and between individuals within a 
population. Small differences in sequences, referred to as "Single Nucleotide 
Polymorphisms " or SNPs, efficiently serve as genetic markers that are useful in 
medicine. Thus the detection and genotyping of SNPs has become an important 
5 task of human geneticists. The evolution of homologous sequences from a 
common ancestral gene is mainly due to nucleotide substitution. Insertions and 
deletions of nucleotides are also known to have occurred during evolution of 
homologous sequences, though at lower rates. 

A DNA molecule having a known sequence and known to be homologous 
10 to a target molecule has not yet been used to reduce the ambiguity of SBH data in 
order to determine the target sequence. 

SUMMARY OF THE INVENTION 

In the following description and set of claims, two parameters are 
15 considered to be equivalent to each other if they are proportional to each other. 

The present invention provides a method for sequencing a target sequence. 
In accordance with the invention, experimental spectrum data obtained from a 
DNA chip is combined with sequence information of a reference DNA molecule. 
The reference molecule is preferably a molecule believed to be homologous with 
20 the target. For example, the target sequence may be a mutant gene and the 
reference sequence the previously sequenced normal gene. As another example, 
the target sequence may be a human gene and the reference sequence the 
homologous gene in another organism. A score is defined for each sequence in a 
set of candidate target sequences based upon a simultaneous comparison of the 
25 candidate sequence with the spectrum and with the reference sequence. A 
candidate target sequence is then selected having a essentially maximal score. 
Calculating the score does not require knowledge of the multiplicities of the 
k-mers in the k-spectrum. Moreover, unlike all prior art algorithms, the score 
does not assume that the spectrum is perfect. 
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The invention therefore provides a novel probabilistic method that handles 
imperfect hybridization data with unknown multiplicities. Thus, in accordance 
with the invention the hybridization of the target T with the k-mer on the DNA 
chip complementary to x is described by probabilities Po(x) and Pi(x) of the 
5 observed hybridization signal when T(x) = 0, and t(x) = 1, respectively The 
results of a hybridization experiment are described by the "probabilistic 
spectrum" (PS ) defined as the pair (P 0 ,Pi) of functions P { : S k -> [ 0, 1]. If the 
experiment were perfect, i.e., if Po(x) and Pi (x) are either 0 or 1 with 
Po(x)+Pi(x) 1, then the PS would represent the k-spectrum. In practice, 

10 however, Po(x) and Pi (x) are both positive. There is thus a chance l-Po(x) for a 
false positive (a k-mer (x) not occurring in T ? whose complementary sequence 
produces a hybridization signal indicative of hybridization) and a chance 1-Pi (x) 
for a false negative (a k-mer (x) occurring in T, whose complementary sequence 
produces a signal indicative of no hybridization). (When handling probabilities, 

15 some of which are perfect, problems of division by zero might occur. This is 
avoided by implicitly perturbing probabilities 0 and 1 to s and 1-g.) 

The probability of obtaining a specific spectrum PS when T is used as the 
target is referred to as the "experimental likelihood". The experimental likelihood 
is calculated assuming that the hybridization results of the target to different 

20 k-mer probes are mutually independent. In one embodiment of the invention, an 
experimental likelihood L e (f ) is used that does not assume knowledge of the 
multiplicities of each k-mer in the sequence. L e (f ) is given by: 






(i) 



25 



Taking logarithms and defining co(3c) = log 



PXx) 



we can write: 



^o(*) 
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Hence, 



logP 0 (i) if f(i)=0 
logP 0 (x) + a>(x) if f(i)=l. < 2a > 



«Z* t(iH (2b) 



The first term is a constant (independent of f ), and is omitted hereafter. 
In another embodiment, an approximate likelihood Z(f ) is used, that is 
defined as follows: Let p = e 0 , e L -k be the path in G corresponding to f and 



define 

L-k 



IS 



< 3 > 

10 D(f)=L=(f) for a path in which all edges have a multiplicity of 1 5 and 

otherwise an approximation to L e (f). L e (f) has the advantage of being easily 
computable in a recursive manner: 

logL e (e 0 , . . . e t ) = logV (e 0> ... e M ) + co(e t ) (4) 

15 In yet another embodiment, an experimental likelihood L e (f ) is used that takes into 
account the multiplicities of edges. In this case, the probabilistic spectrum consists 
of probabilities Pi(x), denoting the probability of the observed hybridization signal 
when the multiplicity of 3c in the target is i. L e (f ) is defined by: 

r{f)=Prob{ps\i)= n W*) 

20 where f (x) is the multiplicity of x in f . 



' ( 4b ) 
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Thus in its first aspect, the invention provides a method for obtaining a 
candidate sequence, the candidate nucleotide sequence being indicative of a 
sequence of a target polynucleotide molecule T, T producing a hybridization signal 
I(x) upon incubating T with a polynucleotide x for each polynucleotide x in a set 
5 E of polynucleotides, the method comprising the steps of: 

(a) for each polynucleotide x in the set E of polynucleotides, obtaining a 
probability P 0 (x) of the hybridization signal I(x) when the sequence x is not 
complementary to a subsequence of T and a probability Pi(x) of the hybridization 
signal when the sequence x is complementary to a subsequence of T; so as to 

l o obtain a probabilistic spectrum (PS) of T; 

(b) assigning a score to each of a plurality of candidate nucleotide 
sequences, the score being based upon the probabilistic spectrum and upon at least 
one reference nucleotide sequence H; and 

(c) selecting one or more candidate nucleotide sequences having an 
15 essentially maximal score. 

In its second aspect, the invention provides a program storage device 
readable by machine, tangibly embodying a program of instructions executable by 
the machine to perform method steps for obtaining a candidate nucleotide 
sequence, the candidate nucleotide sequence being indicative of a sequence of a 
20 target polynucleotide molecule T, T producing a hybridization signal I(x) upon 
incubating T with a polynucleotide x for each polynucleotide x in a set E of 
polynucleotides, the method comprising the steps of: 

(a) for each polynucleotide x in the set E of polynucleotides, obtaining a 
probability P 0 (x) of I(x) when the sequence x is not complementary to a 

25 subsequence of T and a probability Pi(x) of I(x) when the sequence x is 
complementary to a subsequence of T; so as to obtain a probabilistic spectrum (PS) 
ofT; 

(b) assigning a score to each of a plurality of candidate nucleotide 
sequences, the score being based upon the probabilistic spectrum and upon at least 

30 one reference nucleotide sequence H; and 
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(c) selecting a candidate nucleotide sequence having an essentially 
maximal score. 

In its third aspect the invention provides a computer program product 
comprising a computer useable medium having computer readable program code 
5 embodied therein for obtaining a candidate nucleotide sequence, the candidate 
nucleotide sequence being indicative of a sequence of a target polynucleotide 
molecule T, T producing a hybridization signal I(x) upon incubating T with a 
polynucleotide x for each polynucleotide x in a set E of polynucleotides, the 
computer program product comprising: 
10 (a) for each polynucleotide x in the set E of polynucleotides, computer 

readable program code for causing the computer to obtain a probability Po(3c) of 
I( x ) the sequence x is not complementary to a subsequence of T and a probability 
Pi(f ) of I( x ) when the sequence x is complementary to a subsequence of T; 

(b) computer readable program code for causing the computer to assign a 
15 score to each of a plurality of candidate nucleotide sequences, the score being 

based upon the probabilistic spectrum and upon at least one reference nucleotide 
sequence H; and 

(c) computer readable program code for causing the computer to select a 
candidate nucleotide sequence having an essentially maximal score. 

20 

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
First Embodiment 

In this embodiment, the unknown target sequence T = <tj ... ti> has a 
known, homologous reference sequence H = <h\ ... hi>. H and T are known to 
25 differ from each other by nucleotide substitutions without insertions or deletions 
(indels). This would be the case, for instance, when the target T is a mutant 
sequence whose wild type sequence H has already been sequenced, and one 
expects that nucleotide substitutions are the only cause of variability between H 
and T (statistically, substitutions are much more prevalent than indels (Wang et 
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al. 1998)). A set of M xM position specific substitution matrices A/ l) , are 
used, where for each position j along the sequence: 

M (i) [/,/'] = Pr ob(tj = i | hj = i ) (5) 
5 for nucleotides i and i'e£. 

The matrices A^ 0 may be the same for all j, or may different for different 
positions j. The matrices are used to calculate a distribution on the space of 
possible target sequences. This "prior distribution for unmapped homology", D u , 
is given, for each candidate target sequence T by: 

10 



D" (f ) = P rob{f | H) = n M< J > [tj , hj ] 



15 



20 



(6) 



One may recursively compute: 

£ >U {( t i— t j)) = ' 7 ^° ) [0'^y] (7) 

We denote L U) [x, y] = log M U) [x, y] . 

The probability of a candidate target sequence f , given the probability 
spectrum PS and the reference sequence H is: 

P ^^^y Prob{H).p r ob{T\H\Prob(pS\H,T) 

V 7 Prob(H,PS) (8) 

Given f , the hybridization signal is independent ofH: 



25 



Prob(pS \H,T)= Prob(pS \ f ) 
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Thus, omitting the constant Prob ( H ) we can wr j te . 

Prob(H,PS) 



P rob{T | H,Ps)= D" (f ) • If (f ) (9a) 

P rob{t \H,Ps)=D u (f ) • V (f ) (9b) 

or Prob(f\H,Ps)=D u (f)-L e (t) (9c) 

Taking logarithms, the following "ungapped scores" of a candidate target 
are obtained: 

S core? (f ) = log V (f )+ log D" (f ) (10a ) 

Score2»(i)= logZ e (f)+logZ)"(f) (10b ) 

SW3«(t)= logL e (f )+logD"(f ) (10c) 



With Score 11 ] , Score u 2 or Score u 3 , the higher the score of a sequence f , 
the more likely it is to be the target sequence. The highest scoring candidate 

15 sequence may be determined by any method known in the art. In the search for 
the highest scoring candidate sequence, complexity is preferably reduced by 
deleting from the graph edges for which L e (f) L e (f) or L e (f) is less than a 
predetermined constant. Isolated vertices corresponding to highly improbable 
(k-l)-mers, are also preferably deleted from the graph. 

20 For example, using L e (f ), the search for a high scoring candidate sequence 

may be performed by the following algorithm referred to herein as "Algorithm 
A". In accordance with Algorithm A, for each vertex y = (y l ... y k _^) e , and 
integer j = k- 1, k, k + 1, /, let S u \y,j] be the maximum score of a y'-long 
sequence ending with y aligned to (V A) - Initialize, for each y : 
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S'\p,k-l]=f l l! J, \y J .h J ] 



Loop over j = k, /, and for each vertex j> = {y x .~y k _ x ) recursively 
update: 

S"\yj} = L^[y k _ x ,h}+ m^ E) {s"[z,j -l] + w(e)) (12a) 

Finally, return: 



MAX Score" = mwL S"\y,l] (12b) 

yeV 
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A sequence T* attaining the maximal score is found from the matrix S" as 
is known in the art, for example, by saving trace-back pointers: 

p[yJh argmax {s u [z,j -1] (13a) 

Z=<Z n Z t ...Z k -_, >.£f=(f ,j?)ec ^ 

MAXPtr = arg max 5" 0, /] (13b) 

J>eK 

15 The maximum-scoring path in the graph is then followed, by setting: 

Z?=MAXPtr, and for all j = k, : 2' l =P\2j\. Denote 2'=<z / / z / 2 ...z / ^ 7 >. 

The final result is the sequence of nucleotides <z k ~ l j , z h ~ l 2y.z~ l k-i, Zk-u 

k+i i . 

Z ...Zjfc_/> 

The time complexity is O(IK), since the maximization in (12a),(13a) is a 
20 maximum of only a constant number (four) of terms. Although the complexity is 
exponential in k, it is constant for a given microarray (currently feasible values 
are k = 8 or 9). Moreover, the complexity scales linearly with the size of the 
hybridization experimental results, which are part of the input. 
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Space complexity requires a more elaborate analysis. When naively using 
this algorithm, it requires 0(lK) memory space, which is quite high for current 
technology microarrays. We now detail how we can modify the algorithm to 
reduce space complexity. 
5 Observe, that this algorithm consists of two computations: Computing the 

optimal score (equations (ll),(12a) and (12b)), and reconstructing the optimal 
sequence (equations (13a) and (13b)). The first task, of computing the optimal 
score alone, is space-efficient: it can be accomplished using space which is linear 
in the (effective) size of the hybridization experimental data, that is, O(fKJ) 
10 space. 

By following the paradigm of Hirschberg (Hirschberg 1975), for example, 
for linear-space pair-wise alignment, a version of the algorithm is obtained which 
requires only linear space. The reduced space complexity is traded for time 
complexity, which increases by an 0(log I) factor. 
15 For each position j = /, l-l, k, k-l 9 the score of the entire sequence is 

decomposed. The total score is represented as a sum of two expressions: the 
contribution of its (j - k + i)-prefix, which equals the score of this prefix 
computed by S", plus the contribution of the corresponding suffix. Formally, for 
each vertex y = {y ] ...y k _ ] )eV, let R u [y,j] be the maximum contribution to the 

20 score of a (/ -j + k - 7)-long sequence beginning with y aligned to {h hk+2 ...h l } . 
Initialize, for each y : 

R"\?,i] = o (H) 

25 Loop over j = I - 1, I - 2, k - 7, and for each vertex = (;>,... JV,) 

recursively update: 



• # 
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Observe that, for all k - 1 <j < I 

MAX Score" = max {s u \y,j] + R u [y, j] 



y e y v (16) 

5 

Equation (16) can be used to decompose the problem into two similar 
problems, of half its size. Recursively solving these sub-problems gives a 
divide-and-conquer approach for finding the optimal sequence. The linear space 
algorithm is therefore as follows: 
10 1 . If the length / of the target is smaller than some constant C, for 

example, 25 nucleotides: 

Solve the problem directly, according to the dynamic 
program of Equations (1 1), (12a), (12b), (13a) and (13b). 

Otherwise, 

o c + l + k-l 
15 2. Set m = . 

2 

3. For each j = k - 1, m: 

Compute S u [y, j] (following equations (11) and (12a)) for all y , 
re-using space. 

4. For each j = /, / - l f m: 

20 Compute R u [y,j] (following equations (14) and (15)) for all y , 

re-using space. 

5. Find y m = argmax\s u [y,m]+ R u [y,m] , 

y<=V 

thereby computing: MAX Score 11 , by (16). 

6. Recursively compute: 

25 (a) The optimal sequence aligned to (h } ...h m ) 

ending with y m . 
(b) The optimal sequence aligned to (/z OT ...^) 
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beginning with y m . 

Observe, that for each y, j, the values of iS fM [^y]and ^"[j^yjare computed 
a total of log / times. Thus the algorithm takes 0{[K]l log I) time and O(fKJ) 
5 space, using the effective spectrum. 

Second Embodiment: substitutions and deletions 

In this embodiment, the unknown target sequence T ~{t y ...t r ) differs from 

the reference H = (h v ..h { ) , by substitutions and deletions only, without insertions. 

10 Denote the probability of initiating a gap right before hj (aligning hj to 

space) is 2° J . Similarly, $ is the logarithm of the probability for gap extension at 
hj. Also define =log(l-2 Pj )& j =log(l-2° j ). To overcome boundary problems 
at the ends of the sequence, we extend the alphabet by including left and right 
space characters: S = 2 U {>, <} . We augment the reference sequence by the string 

15 > k on its left and < k on the right. We extend the substitution matrix by using 
probabilities that force alignment of each of > and <to itself. Formally, we 
define: 



=Z A " 1 U {xz | x=>',£e2: *"'-'} 
U {zx\z eE y ,x=<*- , - y } 



We arbitrarily set co(y) to 0 for each yeE k Ms k ! . Thus, the weighted 

de-Bruijn graph is naturally extended over E k ~ l , and so is [G] = ([V], [E]), its 
effective subgraph. Hereafter, we use the notation [G] for the extended graph. As 
with the previous embodiment, in order to reduce complexity, edges for which 
25 L e (f) or L e (f) is less than s are preferably deleted from the graph. Isolated 
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vertices corresponding to highly improbable (k-l)-mers 5 are also preferably 
deleted from the graph. 

The search for a high scoring candidate sequence may be performed by the 
following algorithm referred to herein as "Algorithm B". In accordance with 
5 Algorithm B, for each y = (y I ...y k _,)e [V], j = k = 1, k, k + 1,..., 1, S d \yj] is 
defined as the maximum score of aligning a sequence ending with y to (VA) 
where hj is aligned to a gap (and y k _] is aligned to some hj...hj). Further T d [x,j] 
is defined as the maximum score of aligning a sequence ending with (y } ...y k _ } ) 9 
to (/?,... where hj aligned toyt-h Initialize, for each y : 



10 



S d \?,k-l] = -<o; 



(19) 



T d \y,k-l]=< 



k-\ 



0 y =>' 
— oo otherwise 



(20) 



15 Loop overy = k, /, and for each y = {y,...y k _,)e. , [V], recursively 

update: 



S d [y, j] = max {r d [y, + 0j , S d \y, + f3 j 



(21) 



20 



Td [yj] = Lij) [yk-v h \+ max \(o(e)+max 



T d [z,j-l} + a; 



(22) 



Finally, return: 



MAXScore d = T d [< k ~ x ,l] 



(23) 



The complexity of this algorithm is still O(lfKJ) and a linear space variant 
can be obtained, as described in the previous embodiment. A sequence T* 
attaining the maximal score is then formed from the matrix T 1 as is known in the 
art, for example, by saving trace-back pointers to follow the maximally scoring 
path in analogous manner to that described in the previous embodiment. 

Third Embodiment: Substitutions, Deletions and Insertions. 

In this embodiment, a target sequence is determined when the target is 
known to be obtained from the reference by substitutions, insertions and 
deletions. The algorithm is an extension of the dynamic programs of the previous 
embodiments. 

Denote by Tj the target prefix whose last nucleotide is aligned to hj in the 
reference sequence. Further denote by aj (respectively bj) the log-probability of 
initiating (extending) an insertion in the target after Tj, and define 
a J =l-a J ,S J =l-b J . 

Consider the weighted graph (G,to). Define the Kx ^matrix Was follows: 



2 w(?) The (k-l)- suffix of x 



W[x,y} = < 



is the (k - 1) - prefix of y 
0 Otherwise 



(24) 



W [x, y] is thus the probability of moving from 3c to y along / edges. The 
probability of an insertion of length / after Tj is djb'jbj . Suppose that the prefix 
Tj ends with x. Then a J b'7 l b J W'[x,y] i s the probability of Tj+i ending with y 
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and being i nucleotides longer than Tj- The matrix W governing the stochastic 
progression from Tj to Tj+i is calculated as follows: 

"" ' <'./>,""V; ; w,,/.;H"ft,... 

5 

= a.W + a :b ,b W 2 Tb^W 1 ' 2 

J J J J J 

i>2 

= a J W + a J b J b J W'{l-b J WY 

10 A new weighted graph (G^co') is now defined as follows. The vertex set of 

G is also the vertex set of G'. The edge set E f of G is the set of all pairs x,y with 
PF'[x ; j?]>0. Each such edge e = [x,j/] is associated with a weight w'(e) = log 
W%y]. 

The search for a high scoring candidate sequence may be performed by the 
15 following algorithm referred to herein as "Algorithm C". In accordance with 
Algorithm C 5 Algorithm B of the second embodiment is applied to (G' oj) instead 
of (G, 0)). 

In contrast to G, degrees in G' are not bounded by 4. Therefore, computing 
each dynamic program cell has complexity 0(K) in the worst case, with the total 

20 complexity of the algorithm being 0{l\E f \). Again, considering only the effective 
size of the graph allows more efficient computation, taking 0{l\[E r \\). 
Fourth Embodiment: Substitutions, Deletions and Insertions. 

In this embodiment, homology between nucleotide sequences is described 
by Hidden Markov Models (HMMs) using a set Q of Markov chain states with a 

25 predefined set of allowed transitions between them. For each level (position 
along the sequence) j = 1, l Q , Q includes three states: Mj (match), Ij (insert), 
and Dj (delete). Mj and Dj can be reached from the three (J A) (th) level states. Ij 



(25) 
(26) 
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can be reached from the three (j)-(th) level states (including a self-loop). 
Transition probabilities are as described in previous sections, e.g., aj = 
Prob{M t Additionally, each insert or match state, q, induces a vector of 

emission probabilities Af, where Af[i] is the probability that the target nucleotide 
5 is L We denote L q [i] = 0 for q = Dj, L q [i] = log Af[i] otherwise. We write lpb(X) = 
log Prob{X) for short. 

The search for a high scoring candidate sequence may be performed by the 
following algorithm referred to herein as "Algorithm D". In accordance with 
Algorithm D, a three dimensional array S is defined, where for each 
io q^Q,y = (.Vi — >V-i ) G ty\ r - k>'~> L> S[q,y,r] is defined as the maximum score of 
an r-long sequence ending with (y } ... JVi ) > whose alignment to the profile ends in 
q. Thus, initialize: 



S[q start . > k ~\k-\]=Q 



(28) 



15 



S[q, y, k — l] = —oo for other 

values of y, q ^ 29 ^ 



Loop over r = k, ... 1, and for each jJ= (>',...>'*_,) e [v\ r < l Q , recursively 
update: 

20 

S[q, y, r\ = 13 \y k _ x ] + jnax^Sfe', z, r - 1] + lpb(q' h-> q) + O) (e) } (30) 
Finally, return: 



MAX Score = max{s[q endi < k ~ l , /] 



(31) 
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A sequence T* is maximal score is then found in a manner similar to that 
described in the previous embodiments. 

This algorithm requires 0(l Q . [K] ■ L) time and space, where L is an upper 
bound on the size of the target sequence. As with the previous embodiments, the 
5 complexity of this algorithm can be reduced to 0(l Q .[K] L log L) time and 0(l Q 
[K]) memory. Furthermore, one can consider the dynamic program as filling a l Q 
x L matrix, with a |X]-long vector in each matrix cell. Since all values far from 
the main diagonal of this matrix should be negligible, preferably only values 
within a distance less than a predetermined constant R to the main diagonal are 
10 calculated, reducing the complexity to 0(R(l Q +L) [K] -log L) time and 
0{R{l Q +L) -[K]) space. 



Fifth Embodiment: Summation over all paths 

In this embodiment the graph nodes (HMM states and k-mers) that are 
15 most likely to be visited at a certain position along the target sequence are 
obtained. The "Forward-Backward" algorithm is used (see, e.g., Durbin et al., 
1998) providing the likelihood summed over all paths entering a node, instead of 
the likelihood of the maximum path. The only change to the equation presented 
thus far is that max operators are changed into log-sum-of-exponents. More 
20 specifically, equations (12a), (12b), (15), (16), (20), (21), (29), and (30) are 
re- written, respectively, as follows: 

e=(z,y<=E) 



MAXScore u = log^exp^j),/]) 



(12b') 



R " [y, J] = log X exp (R - [z, j + + co(e) + L™[z k _, , h J+x ]) 



e={y,z)EE 



05') 



25 
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^l3>,7] = Mex P (r^,y-l] + aJ + exp(^|j j y-l] + ^.)) 



(20') 



T d \y J)= L(j) ^-^\ +{ ^J>Me)) + 

S[q,y,r} = L«[y k _ x ]+ ^exp^'J^ -l]+lpb(q' q)+a)(e)) fm 



MAX Score = log^exp^, <*"»,/]) 



(30') 



Sixth Embodiment: Enhancements 

In this embodiment the exact likelihood calculated according to Equation 
10a of several top-scoring candidates found using the approximated likelihood 
(Equation 10b) is calculated. These sequences are then re-ranked. This 2-phase 
io filtering is more discriminative than approximated scoring, while still tractable 
using the formulae presented. 

If the score of a dynamic programming cell is very low, that cell probably 
does not participate in the maximum solution. This allows discarding such cells, 
without risking loss of the true optimum. Computing time and space may thus be 
1 5 saved. 

The invention may be used for simultaneously re-sequencing several short 
targets, instead of a single long sequence. This scenario arises when considering 
many exons of a single gene. The invention may also be generalized to deal with 
DNA chips that do not contain the set of all k-mers. 
20 When the set of oligonucleotides on the microarray is not the set of all 

k-mers, a graph is constructed consisting, as vertices, instead of all the 
(k-l)-mers, all the prefixes and suffixes of oligonucleotides on the microarray. 
Edges in this graph connect two vertices if there is one base pair suffix (suffix) 
addition to one of them, that makes the other its proper suffix (prefix). The 
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scoring mechanism remains the same. This also applies for oligonucleotides 
containing "gaps" or "universal bases" (Preparata et al., 1999). 

The invention may be used also for sequencing polypeptides. Given a 
polypeptide chain homologous to a target, and given a collection of probabilities 
of occurrence of sub-chains along the target, our algorithms will find the optimal 
candidate target sequence. 
Example 

The invention was implemented and tested on simulated data. Nucleotide 
substitutions were equiprobable and insertions and deletions were not allowed. 
As a reference sequence, prefixes of the gene-rich human mitochondrial 
sequence, (Accession Number JO 141 5) were used. For each reference sequence, 
the following were generated: 

1 . A target sequence generated according to a prescribed probability q 
of substitution, defining the matrix M as 1-q on the diagonal and q/3 
elsewhere. 

2. An 8-spectrum for the target was generated using the probabilistic 
spectrum defined by P i {x)=\-p if T(x) = i , where p is a fixed probability. 

All probabilistic parameters were constant, i.e., position/£-mer 
independent. For each 8-spectrum and target sequence, candidate 
sequences were scored using Eq. (10), and a candidate sequence of 
maximal score was found. 

The algorithm was implemented in C++ and executed on Linux and SGI 
machines. Running times, on a Pentium 3, 600MHz machine, were roughly 0.12/ 
log / seconds for an /-long reference sequence (ranging from roughly 7 minutes 
for a 500bp-long sequence to 2.5 hours for 6Kb). Only the main memory was 
used, with the application consuming at most 40Mb.The graph was not reduced 
to its effective size. This would have reduced both space and time dramatically, at 
the expense of possibly missing the truly maximal scoring sequence. 

The performance of the algorithm was quantified by the following figures 
of merit: 
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1 . Full success rate-The fraction of runs for which the target sequence 
was perfectly reconstructed. 

2. e-success rate - The fraction of runs for which the target sequence 
of length 1 was reconstructed with fewer than £•! nucleotide errors. 

3. Average sequencing error - The fraction of nucleotide errors. 
Table 1 presents results for a scenario of distinct, but closely related 

sequences, e.g., orthologous genes in a pair of primates. We assume perfect 
hybridization data with 97% sequence similarity (that is q=0.03). The results 
show that sequences of length up to 2000 can be reconstructed almost perfectly. 
The non-monotonicity of the figures of merit with respect to the target length is 
probably due to sequence contents. 

Table 2 presents results for a scenario of SNP-genotyping. The rate of 
SNPs is assumed to be 1 :700 (Wang et al 1998), and p=2% was used. The results 
show that a high success rate is achievable even in the presence of spectrum 
errors. 



Table 1 



Length 


# runs 


% full 
success 


% s-success 
s=10" 3 8 = 2-10" 3 


% avg. 
error 


500 


10 


100 


100 


100 


0.000 


1000 


10 


100 


100 


100 


0.000 


1500 


10 


100 


100 


100 


0.000 


2000 


17 


94 


94 


94 


0.003 


2500 


13 


46 


53 


69 


0.295 


3000 


14 


71 


78 


78 


0.488 


3500 


5 


0 


20 


20 


4.091 


4000 


13 


76 


84 


84 


2.173 


4500 


11 


9 


18 


45 


0.091 


5000 


15 


0 


13 


53 


4.149 


5500 


7 


14 


28 


71 


0.119 



Table 1 : Results on simulated date, for different sequence lengths, assuming 
97% sequence similarity between the target and the reference, and perfect 
hybridization data. 
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Table 2 



Length 


# runs 


% full 
success 


% s-success 
e=10" 3 8 = 2-10" 3 


*Yo 3Vg. 

error 


250 


10 


100 


100 


100 


0.000 


500 


10 


100 


100 


100 


0.000 


750 


10 


90 


90 


100 


0.013 


1000 


10 


90 


90 


90 


0.010 


1250 


10 


90 


100 


100 


0.032 


1500 


12 


91 


100 


100 


0.033 


1750 


10 


60 


80 


80 


0.109 


2000 


10 


60 


90 


90 


4.965 


2500 


10 


0 


80 


100 


10.312 


3000 


10 


30 


70 


90 


0.230 



Table 2: Results on simulated data, for different sequence lengths, assuming 
5 p ~ 2% error the hybridization data, with 1 :700 sequence difference. 

It will also be understood that the system according to the invention may 
be a suitably programmed computer. Likewise, the invention contemplates a 
computer program being readable by a computer for executing the method of the 
io invention. The invention further contemplates a machine-readable memory 
tangibly embodying a program of instructions executable by the machine for 
executing the method of the invention. 



